Author: Konstantin Malykhin
Date: 24.10.2020
You are working with the datasets:
Your key goal is to find which subboroughs spend the least money per capita on taxi rides on average weekday evening (7-9pm) and average weekday morining (7-9am).
When solving this task, carefully define what "spend" means in this context and whether you need taxi pickup points, taxi dropoff points or both.
%%capture
pip install pandas numpy geopandas datetime seaborn plotly dash
After installing the modules, restart the kernel !
## Forcing using shapely >= 2.0.0
import os
os.environ['USE_PYGEOS'] = '0'
## For working with dataframes
import pandas as pd
import json
## For working with space distance
import geopandas as gpd
from geopandas.tools import sjoin
## For working with time
from datetime import datetime
## For working with graphs
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sb
## pandas options
pd.set_option('display.max_columns', None)
Before loading the data, put the data folder with required files in the working directory next to the IPython Notebook script !
# NYC population
NYC_pop = pd.read_csv('../data/2010_NYC_Population_by_Census_Tracts.csv')
# Trip data
trips = pd.read_csv('../data/taxi_trips.csv')
# NYC boroughs and subboroughs
NYC_boroughs = gpd.read_file('../data/gisdata/nyc_boroughs.gpkg')
census = gpd.read_file('../data/gisdata/nyc_census_tracts.gpkg')
# Show dataframes and basic infromation
display(NYC_pop.head())
print('Table: NYC_pop' + '\tShape: ' + str(NYC_pop.shape))
display(trips.head())
print('Table: trips' + '\tShape: ' + str(trips.shape))
display(NYC_boroughs.head())
print('Table: NYC_boroughs' + '\tShape: ' + str(NYC_boroughs.shape))
display(census.head())
print('Table: census' + '\tShape: ' + str(census.shape))
| Borough | Year | FIPS County Code | DCP Borough Code | Census Tract | Population | |
|---|---|---|---|---|---|---|
| 0 | Bronx | 2010 | 5 | 2 | 100 | 11091 |
| 1 | Bronx | 2010 | 5 | 2 | 200 | 4334 |
| 2 | Bronx | 2010 | 5 | 2 | 400 | 5503 |
| 3 | Bronx | 2010 | 5 | 2 | 1600 | 5643 |
| 4 | Bronx | 2010 | 5 | 2 | 1900 | 1917 |
Table: NYC_pop Shape: (2168, 6)
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | RatecodeID | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2016-03-25T10:13:14Z | 2016-03-25T10:27:05Z | 5 | 1.92 | -73.983742 | 40.780811 | 1 | N | -73.958603 | 40.778423 | 2 | 11.0 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 11.80 |
| 1 | 2 | 2016-03-25T10:13:14Z | 2016-03-25T10:24:35Z | 1 | 1.68 | -73.995758 | 40.764381 | 1 | N | -73.994377 | 40.750801 | 2 | 9.0 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 9.80 |
| 2 | 2 | 2016-03-25T10:13:14Z | 2016-03-25T10:50:46Z | 2 | 8.74 | -73.948990 | 40.788612 | 1 | N | -74.016373 | 40.705788 | 1 | 31.0 | 0.0 | 0.5 | 2.00 | 0.00 | 0.3 | 33.80 |
| 3 | 1 | 2016-03-25T10:13:15Z | 2016-03-25T10:21:22Z | 1 | 1.60 | -73.974365 | 40.742847 | 1 | N | -73.970589 | 40.761047 | 2 | 8.0 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 8.80 |
| 4 | 1 | 2016-03-25T10:13:15Z | 2016-03-25T10:43:53Z | 2 | 10.70 | -73.873055 | 40.774200 | 1 | N | -73.982513 | 40.742699 | 1 | 33.0 | 0.0 | 0.5 | 7.85 | 5.54 | 0.3 | 47.19 |
Table: trips Shape: (2649074, 19)
| bor_subb | NAME | CODE | SUBBOROUGH | FORHIS06 | FORHIS07 | FORHIS08 | FORHIS09 | FORWH06 | FORWH07 | FORWH08 | FORWH09 | HHSIZ1990 | HHSIZ00 | HHSIZ02 | HHSIZ05 | HHSIZ08 | KIDS2000 | KIDS2005 | KIDS2006 | KIDS2007 | KIDS2008 | KIDS2009 | RENT2002 | RENT2005 | RENT2008 | RENTPCT02 | RENTPCT05 | RENTPCT08 | PUBAST90 | PUBAST00 | YRHOM02 | YRHOM05 | YRHOM08 | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 501 | North Shore | 501 | North Shore | 37.0657 | 34.0317 | 27.3767 | 29.3091 | 13.2540 | 11.8768 | 11.1788 | 11.1459 | 2.7146 | 2.7338 | 2.7412 | 2.8010 | 2.6983 | 39.2995 | 43.3788 | 38.4022 | 41.5213 | 39.8390 | 40.3169 | 800 | 900 | 1000 | 21.1119 | 24.8073 | 28.5344 | 47.329128 | 6.005791 | 10.805068 | 12.127853 | 11.547433 | POLYGON ((-74.07838 40.64860, -74.07688 40.648... |
| 1 | 502 | Mid-Island | 502 | Mid-Island | 27.9822 | 18.1193 | 24.0452 | 31.1566 | 20.0616 | 19.8575 | 22.4870 | 17.0371 | 2.8233 | 2.7176 | 2.5405 | 2.6228 | 2.5749 | 36.2234 | 35.7630 | 36.9081 | 37.6798 | 37.2447 | 37.8176 | 650 | 800 | 950 | 32.3615 | 27.2584 | 27.9567 | 35.182323 | 2.287034 | 15.241246 | 15.183111 | 14.682124 | MULTIPOLYGON (((-74.20153 40.61449, -74.19995 ... |
| 2 | 503 | South Shore | 503 | South Shore | 10.7019 | 12.1404 | 9.6890 | 14.6638 | 10.3060 | 12.7699 | 9.3561 | 10.2830 | 3.0547 | 2.8497 | 2.6525 | 2.6121 | 2.6483 | 39.7362 | 42.5232 | 40.3577 | 40.3797 | 40.4820 | 35.3880 | 750 | 775 | 800 | 23.0547 | 20.4146 | 18.1590 | 23.894043 | 1.350208 | 12.704250 | 12.972280 | 13.561487 | MULTIPOLYGON (((-74.18659 40.57708, -74.18370 ... |
| 3 | 401 | Astoria | 401 | Astoria | 52.0961 | 53.9585 | 54.6968 | 47.8050 | 38.3658 | 35.6551 | 32.1289 | 34.6578 | 2.4279 | 2.4995 | 2.3032 | 2.3227 | 2.2746 | 28.4592 | 27.2223 | 25.2556 | 24.8911 | 22.0364 | 17.9996 | 1000 | 1100 | 1400 | 25.6022 | 26.7685 | 28.0467 | 80.533928 | 5.204510 | 12.839167 | 13.377513 | 12.544640 | MULTIPOLYGON (((-73.90386 40.78713, -73.90384 ... |
| 4 | 402 | Sunnyside / Woodside | 402 | Sunnyside/Woodside | 62.7242 | 69.3969 | 67.0897 | 58.2963 | 37.0512 | 31.9057 | 32.3264 | 33.8794 | 2.4646 | 2.6287 | 2.5300 | 2.4993 | 2.4766 | 29.8808 | 28.6841 | 28.1440 | 26.3675 | 29.9032 | 26.2156 | 1000 | 1000 | 1400 | 18.8079 | 22.6752 | 21.3009 | 75.516866 | 2.974139 | 15.387656 | 12.518789 | 12.666909 | POLYGON ((-73.90111 40.76029, -73.90081 40.759... |
Table: NYC_boroughs Shape: (55, 35)
| CTLabel | BoroCode | BoroName | CT2010 | BoroCT2010 | CDEligibil | NTACode | NTAName | PUMA | Shape_Leng | Shape_Area | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9 | 5 | Staten Island | 000900 | 5000900 | I | SI22 | West New Brighton-New Brighton-St. George | 3903 | 7729.016666 | 2.497010e+06 | MULTIPOLYGON (((961685.052 172726.408, 961742.... |
| 1 | 98 | 1 | Manhattan | 009800 | 1009800 | I | MN19 | Turtle Bay-East Midtown | 3808 | 5534.199988 | 1.906016e+06 | MULTIPOLYGON (((993521.297 213744.264, 993499.... |
| 2 | 100 | 1 | Manhattan | 010000 | 1010000 | I | MN19 | Turtle Bay-East Midtown | 3808 | 5692.168417 | 1.860938e+06 | MULTIPOLYGON (((992351.125 214645.421, 992224.... |
| 3 | 102 | 1 | Manhattan | 010200 | 1010200 | I | MN17 | Midtown-Midtown South | 3807 | 5687.802440 | 1.860993e+06 | MULTIPOLYGON (((991458.151 215138.776, 991331.... |
| 4 | 104 | 1 | Manhattan | 010400 | 1010400 | I | MN17 | Midtown-Midtown South | 3807 | 5693.035870 | 1.864600e+06 | MULTIPOLYGON (((990566.378 215633.355, 990440.... |
Table: census Shape: (2168, 12)
%%time
# Transform to geopandas & change projection to 32618 (NYC)
trips = gpd.GeoDataFrame(trips, geometry=gpd.points_from_xy(trips['pickup_longitude'],
trips['pickup_latitude'])).\
set_crs(4326).to_crs(32618)
census = census.to_crs(32618)
NYC_boroughs = NYC_boroughs.to_crs(32618)
# Format date and time to python's datetime format
trips['tpep_pickup_datetime'] = pd.to_datetime(trips['tpep_pickup_datetime']).dt.tz_localize(None)
trips['tpep_dropoff_datetime'] = pd.to_datetime(trips['tpep_dropoff_datetime']).dt.tz_localize(None)
CPU times: user 5.19 s, sys: 1.26 s, total: 6.44 s Wall time: 6.44 s
Since the goal of the task is to count spendings in particular time, let's filter the data - take only weekdays 7-9 am and 7-9 pm.
# Get only neccessary weekdays and time of day
# Get workdays
trips = trips[~trips['tpep_pickup_datetime'].dt.day_name().isin(['Saturday', 'Sunday'])]
## Get 7-9 AM and 7-9 PM
trips = trips[((trips['tpep_pickup_datetime'].dt.time > datetime.strptime('19:00:00', '%H:%M:%S').time()) &
(trips['tpep_pickup_datetime'].dt.time < datetime.strptime('21:00:00', '%H:%M:%S').time()) |
(trips['tpep_pickup_datetime'].dt.time > datetime.strptime('07:00:00', '%H:%M:%S').time()) &
(trips['tpep_pickup_datetime'].dt.time < datetime.strptime('09:00:00', '%H:%M:%S').time()))].\
reset_index(drop=True)
It seems like the data is ready for the analysis.
For the analysis of spendings, it's neccessary to define 'spendings'. In this case, spendings - total amount of money that is spent for taxi trip at the pickup location.
# Create column with 'spendings' per passenger
trips['amount_per_passenger'] = (trips['total_amount'] / trips['passenger_count']).round(2)
Now let's merge data with population to the census dataframe.
# CT2010 to int for mergening
census['CT2010'] = census['CT2010'].astype(int)
# Merge population to census
census_table = census.merge(NYC_pop,
left_on=['BoroName', 'CT2010'], right_on=['Borough', 'Census Tract'],
how='left')[['BoroName', 'CT2010', 'NTAName', 'Census Tract', 'Population', 'geometry']]
# Get only centroids of geometry to avoid double counting
census_table['geometry'] = census_table.centroid
Now we have the dataframe of the census tracts with population. Let's merge census tracts and its population to the subboroughs by spatial join.
%%time
# Spatial join of census data and subboroughs
NYC_boroughs = sjoin(NYC_boroughs[['NAME', 'geometry']],
census_table, how='left', predicate='intersects').\
groupby(['NAME'])[['Population']].sum().reset_index().\
merge(NYC_boroughs[['NAME', 'geometry']], on='NAME', how='left')
# From pandas to geopandas
NYC_boroughs = gpd.GeoDataFrame(NYC_boroughs, geometry=NYC_boroughs['geometry']).set_crs(32618)
CPU times: user 26 ms, sys: 1.03 ms, total: 27 ms Wall time: 25.4 ms
# Get the difference of the population after spatial join
census_table['Population'].sum() - NYC_boroughs['Population'].sum()
301164
Here we got difference 301164 people that were 'lost' after spatial join. Here is the closest method to 'save' as many people as possible and at the same time avoid double counting.
# Plot the subborughs with population
f, ax = plt.subplots(1, figsize=(15, 10))
NYC_boroughs.plot(ax=ax, column='Population', legend=True)
ax.set_axis_off()
ax.set_title('Population of the subboroughs, ppl', size = 20)
# plt.axis('equal')
plt.show()
Now let's count total spending per capita in each subboroughs.
%%time
# Spatial join of trips and subboroughs
NYC_boroughs = sjoin(NYC_boroughs, trips[['geometry', 'total_amount']], how='left').\
reset_index(drop=True).drop(columns=['index_right'])
# Merge population and geometries
NYC_boroughs = NYC_boroughs.groupby('NAME')[['total_amount']].sum().reset_index().\
merge(NYC_boroughs[['NAME', 'geometry', 'Population']].drop_duplicates('NAME'),
on='NAME', how='left').round(2)
# To geopandas
NYC_boroughs = gpd.GeoDataFrame(NYC_boroughs, geometry=NYC_boroughs['geometry']).set_crs(32618)
#
NYC_boroughs['amount_per_capita'] = (NYC_boroughs['total_amount'] / NYC_boroughs['Population'])
CPU times: user 1 s, sys: 7.6 ms, total: 1.01 s Wall time: 1.01 s
# Plot the subboroughs with total amount of taxi trips per capita
f, ax = plt.subplots(1, figsize=(20, 10))
NYC_boroughs.plot(ax=ax, column='amount_per_capita', legend=True)
ax.set_axis_off()
ax.set_title('Total spendings of taxi trips per capita', size = 20)
# plt.axis('equal')
plt.show()
# Show the subboroughs with minimum taxi trips spendings per capita
display(NYC_boroughs.sort_values('amount_per_capita').head(6)[['NAME', 'Population',
'total_amount', 'amount_per_capita']].\
reset_index(drop=True))
# Show the barchart
fig = px.bar(NYC_boroughs.sort_values('amount_per_capita').head(6), y='amount_per_capita',
x='NAME', text='amount_per_capita')
# Update layout
fig.update_layout(
title='Subboroughs with minimum taxi trips spendigs per capita',
xaxis_title='Subborough',
yaxis_title='Amount per capita')
# Update text
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
# Show the figure
fig.show()
| NAME | Population | total_amount | amount_per_capita | |
|---|---|---|---|---|
| 0 | South Shore | 155693 | 0.00 | 0.000000 |
| 1 | Bayside / Little Neck | 116852 | 28.10 | 0.000240 |
| 2 | Pelham Parkway | 123764 | 35.60 | 0.000288 |
| 3 | Bensonhurst | 170985 | 68.06 | 0.000398 |
| 4 | Williamsbridge / Baychester | 148040 | 73.62 | 0.000497 |
| 5 | Coney Island | 117274 | 68.40 | 0.000583 |
P.s. South Shore is included in subboroughs despite there is no any data for taxi rides during the filtered time.
# Make a map
fig = px.choropleth_mapbox(NYC_boroughs,
geojson=NYC_boroughs.to_crs(4326).geometry,
locations=NYC_boroughs.index,
color='amount_per_capita',
hover_name='NAME',
center={'lat': 40.71, 'lon': -74.00},
mapbox_style='carto-positron',
opacity=0.5,
labels={'amount_per_capita':'amount_per_capita'},
zoom=9)
# Get rid of margin
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0},
title='Spendings on taxi trips on weekdays at 7-9 am and at 7-9 pm per capita')
# Show the figure
fig.show()
Thus, after the analysis, it could be seen that in subboroughs such as Bayside / Little Neck, Pelham Parkway, Bensonhurst, Williamsbridge / Baychester and Coney Island, taxi rides spendings on average weekday at 7-9 am and 7-9 pm are minimum among all the subboroughs.